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Abstract 

Recent work has focused on the problem of conducting linear regression when the number of 
covariates is very large, potentially greater than the sample size. To facilitate this, one useful tool 
is to assume that the model can be well approximated by a fit involving only a small number of 
covariates - a so called sparsity assumption, which leads to the Lasso and other methods. In many 
situations, however, the covariates can be considered to be structured, in that the selection of some 
variables favours the selection of others - with variables organised into groups entering or leaving 
the model simultaneously as a special case. This structure creates a different form of sparsity. 
In this paper, we suggest the Co-adaptive Lasso to fit models accommodating this form of 'group 
sparsity'. The Co-adaptive Lasso is fast and simple to calculate, and we show that it holds theoretical 
advantages over the Lasso, performs well under a broad set of conclusions, and is very competitive 
in empirical simulations in comparison with previously suggested algorithms like the Group Lasso 
Yuan and Lin|2006 and the Adaptive Lasso |Huang et al.|2008] . 



1 Introduction 



Consider the standard linear regression problem, where for known X £ M. nxp , Y £ 
model 

Y = A/3 + e 



wc assume the 



with e an independent noise term. We fit the model, then, by trying to estimate (3. 

Many modern datasets have a high dimensionality p, in that they have a large number of variables - often 
more than the number of observations. Under this scenario, problems become very difficult. However, in 
recent years, it has emerged that by introducing the concept of sparsity - the a priori assumption that 
/3 has only j3s 7^ 0, with \S\ <C n — methods can deal with such cases with both reasonable accuracy and 
acceptable computational requirements. 



Of particular note is the Lasso Tibshirani 1996 , which chooses, for any value of a tuning parameter 
A > 0, 



argmin^||y-A/3|| 2 + A^|/3 fc 

P 



k=l 



The Lasso has been proven to have a variety of properties, many very favourable van de Geer and 
Buhlmann|2009 Zhao and Yu|2 006 W ainwright|2009 , and fast computational schemes have been con- 
structed Osborne et al.|2000| Friedman et al.|2007 . Extensions have also been proposed - in particular, 
the Adaptive Lasso, which reweights the LI Lasso penalty according to an initial estimator, can have 
very good performance in some contexts Huang et al.||2008 van de Geer et aL]|2010 . Nevertheless, in 
some situations, the sparsity assumption on which the Lasso is based may not be sufficient, or there may 
simply not be enough data or too many unimportant covariates for the Lasso to perform very well. 

A key case we consider in this paper is the case of group sparsity. Under group sparsity, our data is 
augmented with a grouping structure, and we believe that not only is the data somewhat sparse, but 
that the grouping structure we have provides information on the patterns of sparsity that are plausible. 
Typically, we believe that variables in the same group are likely to be simultaneously relevant or irrelevant. 



Such a scenario occurs commonly in a very wide set of contexts. For example, when dealing with 
covariates that take discrete levels, constructing a design matrix with dummy variables for each level 
implies that having an original covariate be irrelevant is equivalent to having all its corresponding dummy 
variables be simultaneously irrelevant. 



Existing work has addressed this problem mostly via the Group Lasso Yuan and Lin 2006 , which for 
groups Gi , . . . , G q and some tuning parameter A is defined as the estimator 



1 



9 

@\ = argmin - \\Y - X(J\\ 2 + A^ \\/3 Gk 

fc=i 



Properties of the Group Lasso have been investigated by authors such as 
computational methods have been investigated by Meier et al. 2008 and 



Huang and Zhang 2010 , and 



Roth and Fischer 2008 



However, the Group Lasso has a number of shortcomings. Firstly, it provides no room for sparsity within 
groups - variables belonging to a group are either all selected, or all unselected simultaneously. But the 
simultaneous presence of within and among group sparsity could indeed be preferable - for instance, when 
the group specification is not completely accurate, or from subject beliefs, or when we wish a sparsely 
representable signal. Secondly, it deals with overlapping groups in a way that might seem unnatural 
- rather than selecting sparsity patterns that are unions of groups, it chooses patterns that are the 
complements of such an union. Finally, the computational requirements of the Group Lasso may still be 
above that of the Lasso. In particular, exact path solutions through LARS schemes arc available only for 
the Lasso due to the particular piecewise linearity of Lasso solutions, and non-sparsity of signals in the 
Group Lasso can require many complex L2 projection steps, slowing down computations and requiring 



more memory. Even in online algorithms, such as Yang et al. 2010 , to the author's knowledge, current 



methods show a gap in computational speed between the LI method and Group Lasso based calculations. 



A variety of previous work has been done to attempt to solve these issues. Friedman et al. |2010 used 
an additive combination of the Group Lasso and Lasso penalties, while Jacob et al. 2009 modified 
the group penalty to deal with overlapping groups. Many of these procedures introduce their own 
problems, however. In particular, several algorithms introduce additional tuning parameters, requiring 
multidimensional grid searches to optimise for them, and hence greatly increase the computational cost. 

In this paper, we adopt a different approach. Instead of using the Group Lasso penalty, we instead 
modify the Adaptive Lasso to use the initial estimate and calculate weights in such a way that it takes 
account of the grouping effect. We call this new estimation procedure the Co-adaptive Lasso, as it is a 
variant of the Adaptive Lasso that shares information between estimates of the coefficients. 



A few other authors have independently produced approaches similar to the Co-adaptive Lasso. Breheny 



and Huang 2009 defined a variety of non-concave grouping penalties, together with the LCD algorithm, 



which is similar to a repeated version of the co-adaptive procedure with a constant tuning parameter. 



Zhou and Zhu 2010 began from a very different rationale, and also produced a similar algorithm, though 



again focusing on finding local minimums for a criterion function through an iterative algorithm. More 
broadly, in non-overlapping group cases, the form of the Co-adaptive Lasso is similar to stopped versions 



of the LLA algorithm of Zou and Li 2008 , for a suitably chosen penalty function. Alahi et al. 2011 also 



proposed a similar scheme as the O-Lasso, albeit for a very specific and dramatically different context. 

Our contribution in this paper is that we focus on finite (in particular, two-stage) procedures, and prove 
their performance qualities, regardless of how - and indeed, whether - the algorithm would converge 
under iteration. This allows us to avoid potential issues where good properties for the global minimum 
can be proven, but convergence to such a minimum cannot. We separately and sequentially choose 
the tuning parameter at each stage of the algorithm, thereby producing an algorithm with equivalent 
computational cost to the Lasso itself. We also address overlapping groups and within group sparsity. 

In Section [2] we define some notation, as well as the Co-adaptive Lasso itself. In Section |3j we give the 
main results of the paper and some broad comparisons with related algorithms. We follow in Section H] 
with more detailed properties. We discuss overlapping group structures in Section[5] Finally, in Section[6j 
we compare the Co-adaptive Lasso to other methods in simulations, and end with a discussion of further 
work. 



2 



2 Notation and Definitions 



Let Y £ K™ be the response vector. Assume, subtracting by a constant intercept term if necessary, that 
^22=1 — 0- X — (X^\ . . . , X^) G R nxp is the matrix composed of covariate column vectors, which 
we assume also to have mean zero. Hence, n is the sample size, and p is the number of covariates. Let 
|| ■ |L denote the empirical L2 norm, and ||- || be the standard L2 norm, with ||- 1^ the LI norm. 

Note that throughout, for clarity, we use small caps Roman letters s to denote scalar or vector quantities, 
capitals S to denote sets or matrices, and script letters S to denote sets of sets. 

In our problem, the covariates have an a priori known group structure. We denote this by 

G = {Gj} q i=l , with \jGj = {l,...,p}, 

3=1 

which defines the membership indicators of each group. We say that Q is non-overlapping if its elements 
are all disjoint. 

For any subset G C {1, . . . ,p}, not necessarily a member of G, we then denote by Xq the columns of 
X corresponding to the indices in G. Similarly, for a vector v, say, we denote by vq the terms of v 
corresponding to the indices in G. We denote by v + and v~ the maximum and minimum value of v 
respectively. 

For any set S, we denote by GnS = {G € G : G n S ^ 0}, and G nS = {G € G : G D S = 0}. We write S 
to be S together with its in-group neighbours - that is, S — UGnS- 

For S = {j G {1, . . . ,p} : (3j ^ 0}, S is group sparse if GnS is a small subset of G- We say the group 
structure is evenly sized if each G € G have the same size, and the problem is all- in- all- out (AIAO) if S 
can be expressed exactly as an union of a small number of sets in G- 

Recall that for a tuning parameter A > 0, the Lasso Tibshirani||1996 estimate is defined as 



= argmin - \\Y - Xpf n +X\\P\\ 1 . (2.1) 

Definition 2.1. Let fi > 0. Suppose /3"' is a Lasso solution for X,Y, for some appropriately chosen 
tuning parameter value A. Then, for G non-overlapping, we define the Co-adaptive Lasso weights, for a 
given covariate j £ G, as 

(2.2) 



The Co-adaptive Lasso solution is then 

$P = argmin \\\Y - + ^ Wj \^\. (2.3) 

3=1 



In the case of overlapping groups, a range of different weight formulations may be considered, depending 
on the type of overlap and signal sparsity pattern. This we will delay until Section [5] 

If each group contains only one covariate, then the weight calculations we have here is identical to the 
standard implementation of the Adaptive Lasso |Zou| [2006 . 

In general, we will suppress the subscripts A and /i in our notation for /3. 



2.1 Restricted eigenvalues and the Lasso 

Key to the performance of the Lasso and most variants of it are conditions on the covariance matrix 



in particular its restricted eigenvalue, or compatibility properties van de Geer and Buhlmann 2009 
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Broadly speaking, these properties measure the minimal eigenvalues or generalised eigenvalues of the 
matrix under some set of restrictions. Failure of the relevant condition implies that there exists feasible 
alternative solutions that give the same fitted values, thus implying the failure of the algorithm. 

In the Lasso case, we define the restricted eigenvalue as, 

2 (L,5,to) = mm f |f% : M D S, \M\ <m,||<M|i < MWs|| ] , 

S - M V W° M W J 

with 4> 2 {L : S) =<I> 2 (L,S,\S\). 



In van de Geer and Buhlmann |2009| and van de Geer et al. 2010 , a slightly different form is given, 
but the above is equivalently applicable. A further variant is available in P. J. Bickel [2009] . We say the 
RE(L, S, m) condition holds if (j) 2 (L, S, to) > 0. 



The usual results for the standard Lasso in this case are 
Lemma 2.1. Let 



A > 2ma,x\X'e/n\. 



Then the Lasso estimate j3\ satisfies 



X(3 X - Xp 



2 14A 2 |5| 
n - </> 2 (6, 5) ' 

< 7Ay^ 
" 03(6,5)' 



Further, 



< 



28A^ 
2 (6,5,2|5|)- 



A proof of this is available in Theorem 7.1 of 



van de Geer et aL| 2010 . Similar theorems have been 



proven by other authors, including P. J. Bickel 2009 . Use of compatibility conditions van de Geer and 
Buhlmann||2009 will yield an LI and prediction error convergence result under similar conditions. The 
choice L = 6 in the instances of <fi can usually be replaced with other values of L > 1, at the price of 
changing the constants in the bounds. 

The Adaptive Lasso in general uses the same conditions. In the Group Lasso, a similar RE property is 
required, but which uses the group L2 norms instead of the LI norm in the restriction, and furthermore 
restricts the considered sets M to be combinations of groups. This second point is one of the reasons 
that the required conditions for the Group Lasso are somewhat weaker than for the Lasso in the case of 
L2 estimation. 



2.2 Conditions for the Co-adaptive Lasso 



For our work, we introduce an additional variant of the RE property. 
Definition 2.2. We define the group restricted RE statistic as 

' \X6\\l 



r g (L,S) := mm : M G S, ||<Mi < W\S\ \\5 S \ 



Lemma 2.2. Let % C GnS be any covering set of S. We have inequalities: 

\H\0 2 (L, S) > 4> 2 {L, S) > min0 2 (i, S, |5| + |G|) > 2 (L, 5)/(l + i 2 |5|). 
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Now, the second inequality in particular can be very loose if \G\ is large, because we consider a much 
restricted set of subsets. In addition, presence of a few small groups may also not matter as much as 
it appears above, because these groups may not coincide with directions where S can be large without 
increasing greatly In particular, <pg is now comparable with the Group Lasso version of the RE 

property. We note that the form of our inequalities are distinguished from the form in|van de Geer et al.| 



2010 in that 2 (L, S,m) > if and only if </> 2 (L, S) > 0. 



Definition 2.3. The group restricted RE properties lead to the definition of the following conditions, 
which are useful for our results. 

Condition Al: 

There exists C > such that for all sufficiently large n, 

2 (3,5) > C. 

Condition A2: 

There exists C > such that for all sufficiently large n, 



<&(3,s)>a 



For the co-adaptive reweighting procedure to be of benefit asymptotically, we require in addition condi- 
tions on the dimensionality of the problem and level of noise relative to the size of the signal and sample 
size, to ensure that the initial Lasso does not do too badly. 

Condition Bl: 

The noise e is independent normal, with variance less than a 2 . \^X^'\\ is bounded. Without loss of 
generality, assume, rescaling if necessary, that the is identically equal to 1. 

Condition B2: 

There exists 71 > 0, 72 > such that 

<x 2 |S|log(p)|Gr , _ 

max 2 2~ — °{ n ) ■ 

G nmax(||/3 G || , min ffegns \\(3 H \\ ) 

The A and B conditions together imply certain convergences in the co-adaptive weights. A final set of 
conditions govern the convergence of the second stage: 

Write 

\G\ 



L\ = max 



GGSns, V 

Condition CI: 

(a) : Given condition B2, there exists C > 0, 5 > such that 

cj) 2 (6L\,S\ > C. 

(b) : Further, for some T D S, 



2 (SL° lt S,\S\ 



T\ > C. 



Condition C2: 

(a): There exists C > 0, 5 > such that there exists T, S C T C S satisfying 

/ / 



dLV, max 2(1 + 0) —== ,1 > C. 

GGSns, \\Pg\\/V\G\ ' " 
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(b): Further, 

/ 



max 



V 



6L° max 2(l + 5) ^ H ^ v pi I ,T,2\T\ I > C. 

GSSns, V ' \\P G \\ / y ^ 1 1 

\ H ^n(S\T) 




Remark 2.1. Conditions Al and A2 are restricted eigenvalue type conditions to ensure the Lasso performs 
sufficiently well in the first stage. They generally hold so long as the covariates are not too highly 
correlated, and IS*! and p are not too large relative to n. Generally, these conditions are a bit weaker 
than those required for the Lasso. 

Conditions Bl and B2 govern the noise level and scaling of the various dimensions of the problem. The 
normality assumption in Bl can be trivially relaxed to any general subgaussian noise distribution. B2 
holds, for example, in any case where the Lasso itself converges and min WPgW does not decrease. In the 
case where, in addition, the expected sizes of the individual coefficients remain constant as \G\ increases, 
B2 holds with 7i > 1, 72 > 0. Note that many bounds in this article may hold even if B2 fails. However, 
the bounds may no longer be useful. 

CI and C2 are restricted eigenvalue type conditions for the second stage. They are similar to those 
in A1-A2. but include allowances for, on the plus side, the first stage's success in removing irrelevant 
groups, and on the negative side, differences between the groups that make it difficult to identify relevant 
or irrelevant variables. CI is most useful in cases where there is little within-group sparsity; while C2 
allows success when the group sizes are too large for CI to be satisfied, as long as there is substantial 
within-group sparsity. 

Relations exist between the conditions. The (b) parts of CI and C2 imply their corresponding (a) 



parts. By Lemma 2.2 if min{|'H| : fXH D S} remains bounded, then A2 implies Al. Further, if 
RE(3, S, \S\ + max |G|) is satisfied, then both conditions Al and A2 are satisfied. The main theorems in 
this article require the satisfaction of all the A and B conditions, plus one of condition CI or C2. 

Condition CI simplifies in the case where the groups are evenly sized, in which case we require only 
that (t> 2 (L,S) is bounded for some fixed L for this to eventually automatically be fulfilled. Indeed, 
in this evenly sized case, if <fi 2 (3, S, 2\S\) is bounded, then conditions CI and Al-2 are satisfied, since 
max|G| < |5| . In the evenly sized groups, AIAO context, then, satisfaction of the conditions for the 
Lasso implies the conditions for the Co-adaptive Lasso are satisfied. 

In this case of evenly sized groups, for C2, the condition becomes dependent on the ratio \\0h\\ / ||/?g|I- 
Noting that this ratio is always greater or equal to 1, satisfying the condition becomes a trade-off between 
choosing T large enough so that this ratio is kept small, and small enough so that restricted eigenvalues 
do not fall too low. Because of the role of |T| in our later theorems, condition C2 is most useful when a 
T can be found with size \T\ = 0(\S\). 



3 Main results 

For simplicity, we will focus on the case of non-overlapping groups. Overall, due to the re-weighting 
nature of the algorithm, the Co-adaptive Lasso necessarily inherits some of the properties of the Adaptive 



Lasso van de Geer et al. 2010 . Whenever the conditions necessary for the the Adaptive Lasso are 
fulfilled, for example, the adaptive weights must successfully distinguish between the zero and the non- 
zero coefficients, with the weights on the non-zero coefficients converging to a negligible fraction of the 
weights on the zero coefficients. In this instance, the co-adaptive weights, as an aggregation of adaptive 
weights within each group, must also successfully distinguish between zero and non-zero groups, so the 
second Lasso step converges to a Lasso performed on the subset of covariates that belong to non-zero 
groups, instead of the whole p-dimensional dataset. In the asymptotic setting of the sample size increasing 
to infinity, taking [i — > 0, we obtain results in the second stage similar to an ordinary linear regression 
conducted only on the relevant covariate groups, implying that the Co-adaptive Lasso is consistent 
for AIAO sparsity in the cases where the Adaptive Lasso succeeds. However, there may be situations 
where the Co- adaptive Lasso succeeds though the Adaptive Lasso does not, or at least attains a better 
performance. 
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The following theorem gives some asymptotic bounds: 



Theorem 3.1. Suppose conditions Al-2 and Bl-2 are satisfied with Q non- overlapping. If for some T 
with S C T C S , either Cl(a) or C2(a) is satisfied, then, writing 



/i = max 



L = max 



V^tp)L°^\og\S\ max 



'n(S\T) 




, max 



Gee ns , ||/? G |IVI^I 

\ ffee n(S\T) 



taking the latter part of the maximisation equal to if S \ T is empty, we have that for any r] > 0, there 
exists X, [i such that 



X (p° LS > T - 



•o I 2 I - 1 I ~2 



JT[ 
n 

ITI 



n 



(3.1) 

(3.2) 
(3.3) 



with probability exceeding 1 — r/. ifere f3 OLS ' T represents the ordinary least squares solution restricted to 
the covariate set T . 



Further if either condition Cl(b) or C2(b) holds, simultaneously 

2 / _ ITI - / ~\ 2 



P 



OLS,T S(c) 



P~P 



= ()[a 2 ^(l //(l + I 



(3.4) 
(3.5) 



In short, so long as the initial Lasso can be guaranteed to not perform too badly, making allowances 
for a limited set of variables T \ S where it is impossible to distinguish relevant from irrelevant, the 
Co- adaptive Lasso performs within a multiplicative factor of the optimal result, with this factor being 
dependent on variability in group sizes and coefficient group norms. 

In some common situations, Theorem |3 . 1 1 can be simplified greatly with some additional assumptions. 

Corollary 3.2. Suppose that Q is non- overlapping, and conditions Al-2 and Bl-2 are satisfied. Suppose 
additionally that 



max \H\ = Ol min \G\ 

H£G^ S \G£Gas 

and there exist T D S with \T\ — 0(\S\) so that Cl(a) or C2(a) is satisfied and 



Heg 



max \\p H \\/\H\ = I min \\[3 G \\ /\G\ 



n(S\T) 



G6S r 



Then for any r\ > 0, there exists A, /i such that for any G, 



X 



(/3-^ (c) ) 



= 1 cr^maxl *~° ]= " , log \S\ 



log IS | 



with probability exceeding 1 — rj. 

If Cl(b) or C2(b) is satisfied then simultaneously 

2 =0[o* 

n 



P~P 



'(c) 



2\S\ ( log |g| . l5l 
= O ( a — max | — — , log \S\ 



\G\^n~f2 



7 



Corollary 3.3. Suppose that max^ggc^ \H\ — 0(min<3eg ns \G\), and Q is non- overlapping. Then if 
conditions Bl-2 and A2 are satisfied and Al satisfied replacing S with S , then, for any n > 0, there exist 
A, (i such that for any G, 



X(p- ^ c >) || 2 = O ( a 2 ^(l + v/log|£||G|^n-72)2 



with probability exceeding 1 — r\. 

If in addition there exists fixed L, C such that cb(L, S, 2\S\) > C , then it is simultaneously the case that 



[3-/3 



■(c) 



O ( a 2 ^(l + y/log\g\\G\-Kn-T) 



— 7-j ^2 



In particular, Corollary |3.3[ in the case of AIAO signals, requires conditions that are the same or weaker 
than those of the ordinary Lasso or Adaptive Lasso. 

We stress that these bounds in general only provide worst case results. In contrast to the Lasso, where 



the penalisation also provides a tight lower bound on the prediction and estimation errors Huang and 



Zhang 2010 , in realistic cases, the distribution of errors amongst the irrelevant groups means that the 



weight calculations can be, and usually will be much better than Theorem 3.1 suggests. Since the Lasso 
selects a maximum of rnin( ri, p) variables, a simple calculation will show that in the best case, under the 
conditions of Corollary 3.3 we spread the error in the initial estimate across min(n, \Q\) groups, resulting 
in 



O a 



>\S\ 



(1 + Vlog|0||G|-Tin-7 a /miii(n,|0|))' 



with similar results for the other inequalities. 

Let us compare the above bounds to performance bounds for some related methods. In the following, 

2 



we focus on the prediction error rate 
obtained for the estimation error. 



X([3-(3^) 



We note that in general similar results can be 



3.1 Comparison to the Lasso 



Now, the standard Lasso has a prediction error on the order of 

Xtf-pV) 2 = (a^\og(p) 



Under the conditions of Corollary |3.3| however, the Co-adaptive Lasso has a prediction error of 



x(f3-py 



O a- 



\S\ 



1 + ^log\G\\G\-^n-^ 



Hence, when the conditions are satisfied, the Co-adaptive Lasso can outperform the ordinary Lasso, 
assuming that the group or sample size is large, and the grouping structure is meaningful in that S is 
kept small. Indeed, in the AIAO case, if n oo, then the Co-adaptive Lasso attains the oracle rate, 
removing the contribution from p. Indeed, in this case, examination of the Irrepresentable Condition 



Zhao and Yu 2006 indicates the Co-adaptive Lasso should be consistent for variable selection under 



much weaker design conditions than the Lasso. 



If on the other hand the conditions of Corollary 3.2 are satisfied, and \Q\ does not grow exponentially in 
\G\n, then 

X((3-p^) 2 =o(a 2lS ^log\S\ 
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In effect, the co-adaptive re- weighting has successfully screened out all of the variables that do not 
belong in the same group as the relevant variables. Since often 151 = 0(n), while p can be anything 
up to exponential in n, this can be a great improvement. Indeed, if \S\ = 0(n), we arrive at bounds 
asymptotically within a constant factor of the oracle rate. 

On the other hand, if condition B2 cannot be satisfied, or S is too large relative to S, the Co-adaptive 
Lasso can under-perform. In empirical experiments, though, we see that the Co-adaptive Lasso often 
outperforms the Lasso even with randomly chosen groups - after all, for small enough groups, the Co- 
adaptive Lasso acts similarly to the Adaptive Lasso, which can have performance advantages. 



3.2 Comparison to the Adaptive Lasso 



The usefulness of Theorem 3.1 lies in the dependence on (minGeSns IIAsll) implied in condition B2. In 



contrast, an Adaptive Lasso approach, being equivalent to a Co-adaptive Lasso with group size 1, would 
use min \j3g\ instead. Suppose that the average squared /3 amongst the true S remains constant or at least 
bounded below. Then as group sizes increase, (min<3 e c; ns ||/3g||) = 0(minc satisfying condition 

B2 with 71 = 1. Meanwhile, in many set-ups, the minimum min|/3s| would not increase, but indeed 
decrease, and hence if the initial Lasso gives errors that are larger than this, performance guarantees 
cannot be given. In terms of 72, if the Adaptive Lasso satisfies B2 for any particular 72, it is implied that 
the Co-adaptive Lasso must also satisfy it for that 72. Similar results arise if we focus on a harmonic 
mean based formulation that gives slightly tighter bounds. 

On the other hand, if there is too much within group sparsity, the Adaptive Lasso may be superior, 
because the Co-adaptive Lasso fails to discriminate as strongly between relevant and irrelevant variables 
within groups. 

Consider as an illustrative example the case with evenly sized non-overlapping groups, where of each 
relevant group of covariates G £ Gns, a subset of size fi(|G| 71 ), 71 £ (0, 1] have the same fixed non-zero 
coefficient value, with the rest being 0. Here, \S\ — \S\\G\ 1 ~" yi . Suppose a 2 \S\ log(p)/n = o(n~ 72 ) for 
some 72 > 0. 

In this case, assuming appropriate conditions are met, the Adaptive Lasso estimate then achieves 



X(p - ^ a >)|[ = O (a 2 ^ (l + v/log(p)n-^) : 



For the Co-adaptive Lasso, we note that maxg |G| 71 / max(||/3G|| 2 , miriH- 6 g ns ||/3f/|| 2 ) = O(l), so B2 is 
satisfied with 71,72. If the additional assumptions of Corollary |3.3| are met, 



X 



(c) 



() I ^ ISWG] 1 71 (1 + v / log | C? || G r|- 7ln - 72 )2 



If y/log(p)n 72 — !• 0, then the Adaptive Lasso attains the optimal rate of O I a 



and the Co- 
adaptive Lasso cannot improve on this. Otherwise, the co-adaptivisation improves things so long as 
Id 1 " 71 = o(log(p)n" 72 ) and 7l > 1/2. 

In particular, if the number of non-zero variables in each group G is a fixed proportion of the full group 
size, the Co-adaptive Lasso can always attain an optimal rate of a 2 S/n if the initial Lasso has o(l) 
prediction error, something that is not the case with the Adaptive Lasso. 



3.3 Comparison to the Group Lasso 



From Lounici et al. 2010 , it can be inferred that the Group Lasso produces prediction error bounds of 
the form 



XV3-$)f=o(^ ]T (|G| + V|G|log(p/|G|)+log(p/|G|)^ 



G6S r 
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= o(~ (\s\ + \g ns \\/\G\io g \g\ + \g ns \ log \g\] 

in the case of non-overlapping, evenly sized groups. 

Suppose Corollary |3.3[s conditions are satisfied. Then the Co-adaptive Lasso attains 



X 



(p - ^ (c) ) |[ - o U 2 ~(i + V^g\G\\G\-^n-^) 2 j 

= O ^ 2 ^(1 + Vlog \G\\G\-^n-^ + log IGWGl-^n-^ 

= o( a — (\S\ + \Sns\VW\ log I^IIGI 1 ^ 1 ^ + \G nS \ log igWGf-^n-T 



If |G| 



1 — dri-72 — 



o(l), the Co-adaptive Lasso is superior to the Group Lasso. In particular, if B2 is 



satisfied with 7! = 1, 72 > 0, then the Co-adaptive Lasso attains quickly the O (a 2 \S\/nJ rate. 

The Co-adaptive Lasso holds a further advantage if within group sparsity exists. Suppose that the 
conditions of Corollary |3.2| are satisfied. Then by that corollary, 

X(p- /3 (c) ) 







= [ — max(bg|a||Gf|-^»-»,log|5|) 
n 

•io[ — ( \s\]og\s\ + -^]og\g\nr» 



n 



\G\ 



This can be a substantial improvement if \S\ log \S\ is much smaller than \S\, and |5||G| 7l n 72 is much 
smaller than |^ n s|- In particular, if \S\/n fails to converge, the Co-adaptive Lasso can succeed if \S\ 
diminishes quickly enough, while we cannot usually expect success with the Group Lasso. 

Nevertheless, the Group Lasso can do better if the conditions for the Co-adaptive Lasso are too difficult 
to satisfy, especially if the initial Lasso fails completely to converge, or if the coefficients in each group 
are very small. In particular, the multi-task learning context analysed in |Lounici et al. [2010 , where a 
set of separate regressions are related by a common sparsity pattern, fails to converge for the Lasso as 
the number of tasks alone increases, and so will generally fail with the Co-adaptive Lasso. 



4 Detailed properties 



For our theoretical analysis, let us assume there exists j3 — f3g such that the data can be written as 

Y = Xf3 + e 

where S is a set of relevant covariates, and e is a noise term. In our analysis, we assume that the 
model is "truly" sparse in the sense that the underlying model f3 has non-zeroes only in S. It is possible 
to encompass the more general case where (3 is only approximately sparse by using e to incorporate 
approximation error arising from sparsification, if the removed covariates have a very small coefficient. 

Our results will be in two halves - first, we show that under a broad condition, the weights w successfully 
separate between covariates from zero groups and covariates from non-zero groups. Then, we show that 
with this separation, we achieve good results on the second optimisation. 



4.1 Convergence of Group Weights 



Existing work P. J. Bickel 2009 van do Gcer and Biihlmann 2009 have proven bounds for the estimation 



error of the Lasso. We show as a variant bounds on the group-wise errors of the initial estimate, and so 
by implication, the weights used in the second estimate. 
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Lemma 4.1. Let 



Then for all G £ Q , 



- Pa 



A > 2max|X'e/ra|. 

2(A + max|X'e/n|) v /|S r | 



/V\G\ < 



<f> g (3,S)(t>(3,SW\G\ 



From Lemma 4.1 the contribution of the noise e then is based on the maximal correlation max \X's/n\. 
As other authors have identified van de Geer and Buhlmann 2009 , it is possible to bound this for 
normally distributed noise: 

Lemma 4.2. Let X e R nxfe , with \\X^ || bounded above by some constant C for all j , ande ~ N(0,a 2 ). 
Then with probability exceeding 

exp(-t/2) 



V^F(t + 2 log(fc))' 



max|X'e/«l < Ca 



t + 21og(fc) 



A similar lemma will work in the case of more general subgaussian noise, albeit with different constants 
in the bound. 

Using the above, we have the following: 

Lemma 4.3. Suppose that Al, A2 and Bl hold and Q is non- overlapping. Choose X — O (max \X'e/n\) 
with A > 2 max \X'e/n\ . Then, for all r/ > 0, there exists C such that with probability exceeding 1 — r] 
we have for each G £ Q, 



\G\ 



< a (o-yj 5 : 



where the inequalities are taken term-wise. 



Proof. Result follows trivially from combining Lemma 
for j e G. 



4.1 



i\G\ J 



and Lemma 



4.2 



noting that 1 = 



(4.1) 



/\G\ 
□ 



4.2 Second stage convergence 

To translate the bounds on the weights into bounds on the second stage, we require some theorems for 
the weighted Lasso. Now, several authors have proven a variety of results relating to this. In particular, 



van de Geer et al. 2010 proved some inequalities similar in spirit to ours. However, their focus was on 
convergence for the Adaptive Lasso, with the additional complication of model misspecification. The Co- 
adaptive Lasso provides a distinct challenge in that we expect faster weight convergence for the variables 
in the out of group set S 10 , and slow or non-existent weight convergence in the case of irrelevant variables 
within groups. 

Lemma 4.4. Results for the second stage 

Fix any T, S 3 T D S such that j3 OLS > T e W, the ordinary least squares estimate when computed on 
the restricted covariate set T , exists. 

j3°t s - T = 0, /?t LS ' T = {X' T X T )- l X' T Y, 

with residual e OLS = Y — X/3 OLS > T . 

Let n eoLS = max|A:'£ 0is |/n ; and ^~ OLS = max \X'~^ T e OLS \/n. 
For [i > max I i_i £OLS /w~ , /i~ OLS /w~ ) , setting 

y S c S S\T J 
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Li 
L 2 



HW+/ ( 

m4/ (. 



flW- 



>* W S\T 



.fOLS 



we have that 

oOLS,T 




T\/ max (j>(Li, S), 0(max(la, L 2 ), T) 
T\/ max [4> 2 {Li, S), 2 (max(L 1 , L 2 ), T) 



1 + max(£i,L 2 ) 



(0 2 (L X , S, |5| + \T\), ^(maxfLx, L 2 ), T, 2|T|) 



(4.2) 
(4.3) 
(4.4) 



Remark 4.1. Note that in all of these theorems, a somewhat better bound and conditions can be obtained 



Wt/vT 



and u>~ with 



in both the 



by using the Cauchy-Strauss bound to replace with 
inequalities and the calculation of L\ and L 2 . This can improve things in the case where a small number 
of groups in the signal (3 are significantly smaller in terms of group-wise L2 norm than the rest, and the 
number of groups is large. However, in this paper, we use the former bound for simplicity, as the latter 
bound leads to conditions on sparsity-weighted harmonic means of group-wise L2 norms that are more 
difficult to interpret. 



The effect of Lemma 4.4 can be examined by varying T. Setting T to equal S, we see that as the weights 
on the out of group variables increase relative to the ones on the relevant groups, we can obtain a fast 
convergence to the ordinary least squares estimate restricted to variables in the same group as the true 
ones, assuming that this exists. We do this by selecting a tuning parameter that scales in a manner 
inversely proportional to the weights on the out of group variables. In other words, by paying the price 
of ultimately just doing a least squares regression on Xg, we remove the influence of the remaining 
variables very easily. This case is especially useful in the AIAO case, or if the group sizes are quite small. 

Meanwhile, setting T to equal S implies that by choosing a higher tuning parameter than before, we can 
attempt to take advantage of the within group sparsity as well. At best, we can obtain a result similar to 
conducting a Lasso restricted to the relevant groups S, a substantial improvement if |5| <C p. We pay a 
price in this case in terms of the ratio Wg / w ^ s j which can be quite significant if impact of the relevant 

groups varies greatly. 

The conditions required for these convergences offer two possibilities. One is to bound <f>{L\,S) away 
from zero, which becomes increasingly easy as the weights ratio between relevant and irrelevant groups 
increase, since L\ decreases. However, this requires that S be not too large, and especially be smaller 
than ri, or else the loss of idcntifiability means the condition is automatically failed. The second condition 
of bounding (f)(max.(L\,L2),T) allows reasonable success in the case of large S, potentially larger than 
n, something distinct from the Group Lasso. Its fulfilment is more complex - if the weights are the same 
within each group, Li>_\. Indeed, L 2 can be quite large if the contribution of relevant groups are quite 
variable, making this condition harder to satisfy and so Lemma |4.4| harder to apply. We can compensate 
however by increasing T to incorporate elements of S \ S that are likely to have small values of w. In 
other words, we can still have good results, if we are willing to accept that we will likely incorrectly select 
some irrelevant covariates in groups that appear collectively very relevant from the initial calculation. 



Our main result then emerges by combining this theorem with the convergence results from Section 4.1 



5 Overlapping groups 

In general, groups in the case of group sparsity cannot be assumed to be non-overlapping. However, 
the existence of overlaps amongst groups poses a problem to the practitioner as to how to handle these 
overlaps. It is then necessary to tailor the algorithm to deal with overlapping groups in the appropriate 
fashion. 
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For the Group Lasso penalty P gr0 up{f3) = YIg 

d 



-Pa 



m = E 



G3i 



has a singularity at ft = if and only if there exists a group G with i 6 G and ||/?g|| — 0- As the 
presence of singularities indicate 'corners' in the penalty for which exactly zero estimates are possible, 
we have that allowable sparsity patterns take the form of intersections of complements of overlapping 
groups, with coefficients in overlaps between groups especially unlikely. 



While this is useful in some cases Huang et al.|2009 , alternative interpretations of group overlaps might 
be more desirable in others. In the general case, a typical application is to find signals that are unions 
of groups. Jacob et al. 2009 proposes one way to accommodate this situation. However, Percival 2011 



established that there are several problems with this formulation - the computational cost can be very 
burdensome, and the conditions required for success are stringent, with generally poor results if the 
structure of the groups are too complex. Particular examples raised were nested group structures, and 
presence of sparsity in true groups. 

However, in the co-adaptive framework, a flexible, and natural alternative framework can be constructed 
to deal with overlapping groups. For example, we can replicate a Group Lasso style behaviour for the 
Co-adaptive Lasso by calculating weights as, for each j = 1, . . . ,p, 



w j 



^2V\g\/Wg\ 

G3j 



For behaviour involving selecting unions of groups, a variety of methods for choosing weights are possible, 
and we suggest here two possibilities for various scenarios: 



5.1 Overlapping group norm minimisation 



One approach to improving the performance of the overlapping Group Lasso, suggested in Percival 2011 



is to calculate the adaptive overlapping Group Lasso. Specifically, Percival 2011 suggests using the OLS 



estimate (3 OLS to calculate group weights, with, for some 7 < 0, wq = \\vgW 1 w hh 
{v G } = argmin^ \\v G \\ s.t.supp(v G ) C GVG,^i; G = p OLS . 



Under this, and assuming some conditions - in particular, that the above decomposition is unique in a 
neighbourhood around around the true value, and the overlap norm minimising decomposition of the 
true value itself is tight around the true relevant covariates - they are able to prove consistency in the 
fixed design case where n alone goes to infinity. 

Such a strategy may be similarly applied to the Co-adaptive Lasso. For example, by conducting a 
weighted Lasso using for each covariate the minimal weight amongst groups containing the covariate, 



the same proof given in Percival 2011 will suffice to show the same consistency result under the same 
conditions. A second modification would be to use instead, which would allow use when p > n. 

On the other hand, group norm minimisation weights has a variety of shortcomings. As discussed in 
the uniqueness criterion is strong, and rules out possibilities like nested groups. One 



Percival 2011 



potential improvement would be to apply the overlapping group norm instead as a penalty, treating the 
weight calculation as a Group Lasso problem itself, using f3^ to replace Y, and X as an identity matrix 
with repeated columns for variables appearing in more than one group, and the same grouping structure. 
Here the minimisation result corresponds to the case where the tuning parameter is taken to zero. 

Assume that the decomposition of the true /3 that minimises the overlapping norm gives weights Wq^c- 



Using an application of Proposition 1 in Percival 2011 , and an appropriately chosen tuning parameter 
value, we can then obtain a bound 
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\u>g 7 — wq.g 7 | < 32max 

G G 

where 1/k 2 e [|£s|/16, 4|(?|] is a constant depending only on the complexity of Q. While this approach 
can be effective, choosing the correct tuning parameter may be difficult, and may not be useful for many 
types of grouping structure. 



/« 2 , 



5.2 Maximum covariate-wise group norm 



A simple alternative approach is to take the initial estimate, calculate the grouped L2 norms ||/3g|| for 
each group G, and weigh each covariate j as 

w j =mmy/\G\/\\fc\\. 



This approach reduces to the previous case when the groups do not overlap. Lemma |4 . 1 1 applies equally 
to this weight formulation, and we have then that by an analogous proof to Lemma |4.3| the following: 



Lemma 5.1. Suppose that Al, A2 and Bl hold. Choose A = 0(max \X'e/n\) with A > 2m&x\X'e/n\. 
Then for all rj > 0, there exists C such that with probability exceeding l—n we have for each j G {1, . . . ,p}, 



\Pg 4 



f l S]log(p) 
n\GA 



Gj = argmax / y/\G\. 

G9j 



We can then adapt Theorem |3.1| to use this lemma instead of Lemma |4.3| deriving similar bounds. 

The problem with this approach is that the lightly penalised covariates - that is, the variables j for 
which vjj y> oo as pj — » /3 will correspond to U^ n s, which can be potentially very large, especially if 
there exists large groups that overlap significantly with S. In that case, the co-adaptive re- weighting will 
eliminate an insufficient proportion of the variables, thus leading to not a very good fit. Note that this 
is a problem similarly present with the overlapping Group Lasso. 

One possible way to fix this issue is if we know or can infer the degree of overlap of groups with S, or 
whether there exists a covering set of groups for which there is very little within group sparsity. Then, 
we can compute instead of H/fcJI i appropriately trimmed or winzorised sums of 0q.. By this method, 
we reduce the mistakenly picked up overlapping groups to only those with large overlaps with the truly 
relevant ones, paying the price of potentially losing covariate groups that have very high levels of within 
group sparsity. 



6 Empirical results 

We shall investigate the performance of the Co-adaptive Lasso in a variety of datasets, both real and 
generated. 



6.1 Splice site prediction 



As considered in Meier et al. 2008 , the splice problem concerns the prediction of splice sites - the regions 
between coding (exons) and non-coding (introns) DNA segments. In particular, the task of predicting 
'donor' splice sites - the 5' splice site end of introns - has been commonly used to demonstrate the Group 



Lasso |Meier et al.||2008| , |Roth and Fischer||2008] . 



As in Meier et al. 2008 



we consider the MEMset dataset [Yeo and Burge|[2008 , which consists of sub- 
sequences of DNA that contain the consensus position "GT" , which are either true donor splice sites, or 
otherwise. Removing the consensus position then gives sequences of length 7 with 4 levels ({A, C, G, T}). 
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The original dataset consists of 8415 true and 170438 false human donor sites, with an additional test set 
of 4208 true and 89717 false donor sites. As per the original, we use the training set to build a smaller 
balanced training dataset with 5610 true and 5610 false donor sites, and an unbalanced validation set 
with the remainder, having the same ratio of true to false sites as the test set. This is done through 
sampling at random, without replacement. 

Our goal is accurately predict whether a candidate site from the test set is a true or a false donor splice 
set. This is measured by the maximum attainable correlation coefficient Yeo and Burge 2008 between 
a predicted vector and the true one, 

cor max = maxcor (y test , I {p(x tes t) > t}) . 



We consider dummy variables as in Meier et al. 2008 , using scaled versions of the indicator variables of 



({A, C, G}), plus all interactions up to three way. We train on the balanced dataset, using the validation 
set to choose tuning parameters - unlike Meier et al. 2008 , we select according to maximising cor max on 



the validation set instead of maximising likelihood, 
the intercept. 



This allows us to dispense with the need to correct 



We compare logistic versions of the Lasso, the Group Lasso, the Adaptive Lasso, and the Adaptive Group 
Lasso, and the Co-adaptive Lasso. For the two stage procedures, we use the same training and validation 
set for each stage of the procedure. 





-©— Co-adaptive 

-e- Lasso 

-A- Adaptive 

-e- Group 

-A- Group adaptive 



100 



150 
Sparsity 



200 



250 



300 



Figure 6.1: Sparsity level vs cor max on the test set. The points on each curve represents the model chosen 
by reference to the validation set. The chosen model for the Group Lasso uses 966 variables and so is 
omitted from the graph for presentation reasons. 



Algorithm 


cor max 


Co-adaptive Lasso 


0.659 


Lasso 


0.656 


Adaptive Lasso 


0.656 


Group Lasso 


0.660 


Adaptive Group Lasso 


0.664 
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Group Sparsity 

Figure 6.2: Group sparsity level vs cor max on the test set. The points on each curve represents the model 
chosen by reference to the validation set. 

Overall, the attained cor max is highly similar for all algorithms. The Adaptive Group Lasso achieves the 
best results, with the Group Lasso and Co-adaptive Lasso having similar results. The Adaptive Lasso 
and Lasso perform less well, but still reasonably. These results suggest there is not much within group 
sparsity on this dataset - the high level of correlation (which, on the training set, reached a maximum of 
0.88) mean there is some advantage in selecting redundant variable sets to reduce variability on the test 
set. 



More interesting is the breakdown of performance according to group sparsity and sparsity. Figure 6.1 
and Figure [6~2] give these results. We see that while the Group Lasso and Adaptive Group Lasso perform 
well overall, they require a much large set of selected variables to do so. The Lasso and the Adaptive 
Lasso meanwhile select a small number of variables, but fail in terms of attaining good results at any 
level of group sparsity. The Co-adaptive Lasso therefore managed to attain an excellent trade-off, by 
attaining much better within group sparsity, while still remaining competitive with the Adaptive Lasso 
in achieving good group sparse solutions. Indeed, for the most group sparse solutions, it even beats the 
Group Lasso. 

Similar results were obtained when indicator variables including 'T' were used to create an intentionally 
over-specified design matrix. In our experiments, the Lasso based algorithms were the fastest, though the 
majority of time was consumed with handling the large dataset. It is likely that an online formulation 
would have been more efficient. 



6.2 Artificial datasets — Non-overlapping 

We set up a series of simulation experiments to test the Co-adaptive Lasso against a range of similar 
algorithms. In particular, we compare against 



The Lasso |Tibshirani||1996 



• The Adaptive Lasso Zou 2006 



16 





Scenario 1 


Scenario 2 


Scenario 3 


Scenario 4 


Scenario 5 




Const 


Norm 


Const 


Norm 


Const 


Norm 


Const 


Norm 


Const 


Norm 


Lasso 


2.81 


2.11 


16.2 


7.64 


2.84 


1.88 


2.10 


1.74 


0.57 


0.50 


Adaptive 


1.35 


1.14 


15.87 


5.48 


1.38 


1.11 


1.08 


1.01 


0.21 


0.20 


Group Lasso 


1.09 


1.09 


2.25 


2.09 


8.45 


7.50 


0.62 


0.58 


2.12 


1.93 


SGL 


1.84 


1.39 


10.05 


5.09 


2.09 


1.42 


1.47 


1.20 


0.41 


0.34 


Co- adaptive 


0.55 


0.53 


3.51 


1.21 


1.35 


1.40 


0.37 


0.34 


0.36 


0.39 



Table 1: Results for simulated datasets. The table shows estimation error. The best performer in each 
scenario is highlighted in bold. 



The Group Lasso |Meier et al| 2008 



• The Sparse Group Lasso Friedman et al. 12010 



We generate a range of scenarios showing group sparsity and varying levels of within group sparsity in 
the linear regression context. Specifically, we generate 



Y = Xf3 + e, 



with X as a n x p matrix from an i.i.d. standard normal distribution. Then, choosing non-overlapping 
groups with group size \G\, we choose S as random subsets of 2 selected groups, with the same within 
group sparsity level in each group. We repeat each scenario with the final selected variables j3s as either 
independently standard normal or constant at 1. Finally, we generate the noise e as independent normal, 
with variance chosen so that each iteration would have a SNR of 2. 

Our scenarios are: 



1. 


71 = 


150, p = 


2000, 


|G| 


= io, \s\ = 


10 


2. 


71 = 


150, p = 


2000, 


\G\ 


= io, \s\ = 


20 


3. 


71 = 


150, p = 


2000, 


\G\ 


= 100, |5| = 


= 10 


4. 


71 = 


500, p = 


2000, 


\G\ 


= 10, \S\ = 


20 


5. 


71 = 


500, p = 


2000, 


\G\ 


= 100, |5| = 


= 10 



Hence, Scenario 2 and 4 are AIAO situations, while the remaining scenarios have some degree of within 
group sparsity. 

We conduct 100 iterations of each scenario, and compute estimates choosing tuning parameters by 10 
fold cross validation. In the case of the SGL, which has 2 parameters, we use the default implementation 



(3-f3 



2 



given the 



which fixes the mixing parameter a at 0.95. We compare the estimation error 

independence of our covariates, this is equivalent to comparing the expected error on a new test set, 
minus the contribution from the new s. Table [T] shows the results of our simulations. 

From these results, the Co-adaptive Lasso performs well in all of the scenarios. In most, it performs the 
best, or nearly the best, with the exceptions of scenario 5 and scenario 2, constant version. In the former 
case, understandably it performs less well than the Adaptive Lasso because the groups in this case are 
not very informative, while the sample size is large so the initial Lasso provides good weights for the 
adaptive second stage. In the latter case, there is no within group sparsity, and the inherent L2 penalty 
of the Group Lasso helps encourage it to make an estimate that is more constant in magnitude within 
groups, which matches the true signal. In comparison, the SGL, which is the other algorithm attempting 
within group sparsity, performs roughly in between the Group Lasso and the Lasso. It needs to be noted 
that this behaviour was observed using the default values for the second tuning parameter - more success 
might be reached by selecting this, for instance, through a grid based cross validation. However such an 
approach would necessarily greatly increase the time require for computation. 
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Const 


Norm 


Lasso 


4.3 


3.5 


Adaptive 


2.5 


2.3 


Group Lasso 


1.4 


1.4 


Co-adaptive 


1.1 


1.2 



Table 2: Results for simulated datasets. The table shows estimation error. The best performer in each 
scenario is highlighted in bold. 



In terms of computational time, the Lasso based methods were much faster than the Group Lasso or 
SGL. However, this may be due to the specific implementation of the algorithms. 



6.3 Artificial datasets — Overlapping 



We conduct a second set of simulations with overlapping groups. Here, we compare the Co-adaptive 
Lasso with the minimum group-wise norm weights to 



The Lasso |Tibshirani||1996 



• The Adaptive Lasso Zou 2006 



The Overlapping Group Lasso Jacob et aL][2009 



In this case, we generate again 

Y = X/3 + e, 

with n = 500, p = 2000. We choose S = {1, . . . , 20}, and choose as two scenarios f3s either standard 
normal or constant at 1. We generate s as before to be independent normal with variance chosen to give 
SNR of 2. 

We give the data a more complex group structure: Q = {G\, . . . , Giqo> Gioi, ■ ■ ■ , G120}, with 

d = {20(i - 1) + 1, ... , 20(« - 1) + 20}, for i = 1, . . . , 100 

d ={!,..., 100} x (i - 100), for i = 101, . . . , 120. 



In this setup, each of G±, . . . , G100 overlaps with each of G101, ■ ■ ■ , G120, though only G\ is required to 
cover the signal. We conduct again 100 iterations of these scenarios, using 10 fold cross validation to 
select tuning parameters and comparing the estimation error. Table [2] gives the results. 

Once again, the Co-adaptive Lasso is superior in both cases. We note that this happens despite the fact 
that S in this case is the entire set {1, . . .p}. This is likely because in this scenario, even though the 
groups G101, ■ • ■ , G120 overlap with 5, the average signal on them \\/3g\\ / y/\G\ is much smaller than for 
G\. 



7 Discussion 



We have defined a fast calculating method of conducting variable selection under a group structure, that 
facilitates the use of within group sparsity. The Co-adaptive Lasso can be easily coded using any existing 
methods of calculating the standard Lasso, including online computation methods. We have proven some 
convergence properties for the method, and illustrated its competitiveness relative to several state of the 
art methods. The procedure may be applied to a range of contexts. 

Several areas warrant further investigation. The Co-adaptive Lasso with its framework of re-weighting 
Lassos can be fairly easily extended to include more complex re-weighting schemes. These may allow 
the utilisation of more complex subject specific prior information. For instance, in All-In-All-Out cases 
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where the within group sparsity is low and the signal is spread out relatively evenly amongst members 
of each group, robust statistics related methods of weight computation have worked well to screen out 
single-variable mistakes made by the initial Lasso calculation. We have suggested this in the case of 
overlapping groups, though this can be useful more generally. 

Several previous examinations of concepts similar to the Co-adaptive Lasso have focused on repeating 
the procedure until convergence. The general idea of those procedures is to attain local convergence 
to the optimum of some implied concave penalised maximum likelihood problem, but because these 
penalised maximum likelihood problems have characteristically multiple optima, it's not clear whether 
this would imply good properties, despite the necessary increase in computational cost. Moreover, more 
complex re-weighting procedures fall out of this paradigm, because there can sometimes be no compatible 
implied penalty function. Further investigation may go into when these iterative procedures can improve 
performance. 



A Appendix: Proofs of theorems 

A.l Proof of Lemma 12.21 

Proof. Let S be any vector satisfying 



|^%=0 2 (L,5), \\Ss4, <Ly/\S\\\Ss\\. 



Assume without loss of generality that ||-X^|| = 1. Then by Definition 2.2 for each H G H, \\Sh\\ 2 < 
Hence 

1/0 2 (L,S) = HM 2 < £ \\S H \\ 2 < \H\/<f>%(L,S). 
Hen 

Similarly, if 6 satisfies 



Meg \\s M 

assuming ||X<5|| = 1 means by definition that 



l/(f> 2 g (L,S) <max||<W|| 2 < max ||<W|| 2 < 1/4?(L, S, \S\ + max |G|). 
Finally, if £ satisfies 

mm \ |f% : S C M, \M\ < \S\ + max |G| \ = <f> 2 (L, S, \S\ + max \G\), \\S s 4i < ^VWl ¥s\\ , 

M { ¥m\\ J 

assuming \\XS\\ n = 1 means for some M, \M \S\= max |G|, \\5 M \\ 2 = l/^ 2 (£, S, \S\ + max |G|). 
Then 

||<5s|| 2 = \\Sm\\ 2 ~ \\S M \ S \\ 2 

> \\ s m\\ 2 - ||^/\s||i 
>\\6 M f-L 2 \S\\\S s f. 

So 

> \\s s f > "^" 2 > 1 . 

<^ 2 (L,S) " 11 511 " 1 + L 2 |S| " 2 (L,S,|S|+max|G|)(l + L 2 |S|) 



□ 
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A. 2 Proof of Lemma 14.11 



Proof. Similarly to Lemma |4.4| we have that 

AH/% -A jgw 



1 

> - 
i - 2 

1 

> - 

- 2 

1 

> - 

- 2 



Y-X0W -\\Y-XP\\* 

n 

X0W-P)f n -(e,x(pW-p)) n 

2 

X(j3 (l) - (3) - max \X'e/n\ pf c 



max \ X'e/n\ 



Hence 



1 



X(j3 {l) -I3) +{\-max.\X'Efn\) fife < (A + max \X'e/n\) 



So by definition of <j>(L, S), 4>g(L, S), we have by a similar argument to Lemma 4.4 that 

2(A + max|X'e/7i|) v /|5| 



< 



<KL,S)<h(L,S) 



with 



A + max \X'e/n\ 3max|X'e/n| 

Li = ~ , ; — r < ~~- 



A — max|X'e/n| ms^\X'ejn\ 
As 4>(L, S), 4>g(L, S) decreases with L, the rest follows. 



□ 



A. 3 Proof of Lemma 14.21 

Proof. Similar proofs have appeared elsewhere. We present the proof here for convenience. 

Now, X'sj 1 1 AT 1 1 are distributed identically (but possibly not independently) Normal with variance a 1 . 
Therefore setting S = \ / *+ 21 °s( fc ) ^ we nave 



P(\\X's\\ In > SCa) < \T\P(\X'e\/(\\X\\ a) > yftS) 



< 



\T\ 



exp(-n<5 2 /2) 



< 



lrmb~ 

exp(-i/2) 
0r(i + 21og(fc))' 



□ 



A. 4 Proof of Lemma 14.41 



Proof. Our proofs are similar in spirit to those in van de Geer et al. 2010 
By definition of the weighted Lasso, we have that 
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2 2 \ 
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X (p) - p OLS > T ) 2 - (s OLS ,X ( p& - p OLS > T )) 
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> 



X 0(c) - /3 OLS > T ) 



S" 



_ ..sols 
1 ^ 



S\T 



since X' T e OLS = 0. 
Hence, 



X(/3, 



OLS,T 
T 



-d [c) ) 2 <ii( 

n \ 



oOLS,T 

w T p T 



B^ 

^S\T 



w T P { T c) 



(c) 



< M ll w T 

(4' 



oOLS,T _ 15(c) 



oOLS,T 

Pt 



s\t) 



H S\T 
^S\T 



+ M £ 



Let n > max (tf° LS /w~ c , ^f LS /™s\t) ' Settin S 



L 2 = liw^l (jJ-w- - ii1 OLS \ 



J S\T 



J 



we have then that 



y/W\ 


nOLS,T 

Pt 




> 


m 


max(L 1 ,L 2 )y/\f\ 


oOLS,T 

Pt 


~P { T ] 


> 





3(c) 

S\T 



/i 2 



and by a similar argument 



pOLS.T _ ^(c) 



> 



3? 



By definition of (f>(m&x(Li, L 2 ),T) and <j>(Li,S), observing in the latter case that 

pOLS^T _^(c) ^ vv ]|avo (]m| 



oOLS,T 



max 



X 0OLS,T _fic)} 



- 2 



< 

< max(Li 



fiOLS 



n \ ° 

Hence 

| X (^ol5,t _ £ (c ) ) 1 1 < ^ / max , 0( max (L 1 , L 2 ), T)) 

5(c) < 2^+V]T|/max(0 2 (i 1 ,5),(/) 2 (max( J L 1 ,i 2 ),T)) 

n , L 2 ) 2/1W+ \T\ I max (V (Li , S) , <f (max(Li , i 2 ) , T)' 
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jOLS,T 



< 2 (l + max(L 1 , J L 2 ) v /|T|) ^w^y/\f\/maxU 2 (L u S),4> 2 (max(L 1 ,L 2 ),T)' 



(A.5) 



The third inequality can be quite poor for large |T|. On the other hand, suppose that <fi(Li, S, \S\ + \T\) 
or 4>[max[L x ,L 2 ),T, 2\T\) > 0. Then for any M D T, \M \T\ < \T\ , such that min \W\ > max \(3 {c) 1 



we have that 



1SOLS.T _ 3(c) 
Pm Pm 



< 2fiw+y/ff\/ max U*(Li, S, \S\ + \T\),<f?(max(L 1 , L 2 ),T,2\T\) 



From Lemma 2.2 of 
so 



van de Geer and Buhlmann 



2009 



< 2/xw+vjr 



, however, we have that 
1 + max(Xi, L 2 ) 



3 {c) 



< 



(f(L u S, \S\ + \T\), ^(maxCLx, L 2 ),T, 2\T\) 



□ 



A.5 Proof of Theorem PI 

Proof. Fix aTCSfor which the conditions are satisfied, and choose any t > so that 3 cxp(— 1/2) / y/n < 
t] . Assume for now that T ^ S. Writing P = I - X T (X^X T )~ l X^, 

lfoz.s = max \x'Pe\/n = max\(PX)'e\/n, fi £ ~ OLS = max\(PXg)'e\/n, X e = max\X'e\/n. 



But P is a projection matrix, so for all X^\ ||PXw|| < Therefore by Lemma 4.2 

Assumption Bl, each of the following fails to occur with probability less than exp(— t/2)/y / 7n 



under 



„.„ / , < lt + 2lo g]S \ =0 l logl^j 



Further, by Lemma 4.3 taking rj there to be less than exp(— 1/2) /y/ir, Conditions Al-2 and Bl implies then 



that choosing A = 2A e , there exists some C > 0, such that with probability exceeding 1 — exp(— t/2)/y/n, 



w Sc > mm C crW — 

S GGS^s I V n \ G \ 



|5|log(p) 



n|G| 



With Condition B2, for all C" > 0, there exists Uq so that for all n > no, with the same probability, 



Wj* < max 



> min 



GeGns \\Pg\\ - C" \\j3 G \\ y/\G\-Kn-n 
W\ 



s\t ~ G G s n(gw ||/3 G || + C" \\(3 G \\ y/\G\-nn-i* 
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w~ > 



min Gee c s y/\G\^n^ 



S'~ C"™in H& g ns \\p H \ 



Assume then for the remainder of this proof that this is the case, together with the previous bounds on 
fi €OLS and ^| OI/S - We see that this is true with probability greater than 1 — 3exp(— </2)/y / 7r > 1 — rj. 



Choosing \x = 2max(/j, e ° LS /w~ c , fi £ ~' LS /w~^ T ) gives, by the definitions in Lemma 
neously the case that for all n > uq, 



4.4 



that it is simulta- 



< 2/j,w£/[iw~ c = 2w^/w~ 



< max 



W\G\ 



C" 



GeGns,H£G c ns 1 - C"y/\G\-nn-y* y/\Hy-+-nni2 



< 2nw+/nw~ T = 2w+/wg 



J/3 H \\ y/\G\l + C"y/\H\-^n-y 

< max 2 —. —^^=^^= . 

Geg n s,H€G n( s XT) \\Pg\\ VW\ 1 - C"y/\G\-nn-n 



Hence, for any value of S > 0, as |G|,n— > oo, eventually 



/ \G\ 

L\ < max S\ , TTM , 



Similarly, for all 5 > 0, with \G\, \H\, n sufficiently large, it is the case that 

L 2 < , max 2(1 + 6)^^ 



GeGns.HEG 



Therefore under condition Cl(a) or condition C2(a), we hav e that there exists C such that either 
4>(L\,S) > C or (f>(m&x(L 1 ,L 2 ),T) > C . Hence by Lemma 4.4 



X 



(pOLS,T <2^w+^\f\/C 



< 2cry , ^max ( 2yfogb)-^,2v/log|^|-^ ) /C 



W S\T 



\T 



O [ cry — max I ^\og(p) max 



log | S\ max 



GeGns,H£G c ns ^{H^+^n^ 

fa\\y/W\\ 



Geg ns ,H£G n( s XT) \\Pg\\ y/\H\ 



IT 



O [ cry — max I y / fog(p) max 



log | S\ max 



GeGns,HGG c ns ^/|iJ|l+7l n 72 

h\\VW\\\ 



GeGns.HeG ni s XT) v/|i/| 



Similarly, under condition Cl(b) or C2(b), we have that there exists C such that either (j>(L\,S, \S\ 



|T|) > C or 0(max(Li,L 2 ),T,2|T|) > C. Then, again by Lemma 4.4 
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pOLS,T _ £(c) 



= I T \; in; n\ 
n 



Vlogm) max — . — , 



log|S| 



max 



n(s\T) HPG 



l#l 



(1 + max(Li, L 2 )) 



Now, under condition CI or C2, the smallest singular value of Xx/y/n remains bounded away from zero. 



Therefore 



rule gives the result. 



X (pOLS,T _ = {G^/\r\Jn), and 



OLS,T 



0(ay/\T\/n), so using the triangle 



If T = S, we can proceed with the proof as normal, simply omitting the terms relating to S \ T and 
condition C2. 

□ 



A. 6 Proof of Corollary 3.2 and Corollary |3.3 



Proof. Simply apply Theorem 3.1 observing that given the condition on \G\, \H\, L\ — > 0. In the case 



of Corollary 3.3 condition Cl(a) is thus implied by Al. We have then that in the second lemma, 

(j8 " ^ (c) ) f = O ( a 2 M(i + ^\og(p)\G\-^n-^) 2 \ . 



X 



But with non-overlapping groups with proportional sizes, p = 0(\Q\\G\), so 



log( P )\G\- 
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0(logQ\G\-^ +log\G\\G\-^) 

o(io g a|G|-^ + i). 



Corollary |3.2| follows similarly. 



□ 
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